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With the most recent advances in technology, computer programming has 
reached the capabilities of human brain to decide things for almost all 
healthcare systems. The implementation of convolutional neural network 
(CNN) and extreme gradient boosting (XGBoost) is expected to improve the 
accurateness of breast cancer detection. The aims of this research were to; i) 


determine the stages of CNN-XGBoost integration in diagnosis of breast 

cancer and ii) calculate the accuracy of the CNN-XGBoost integration in 
Keywords: breast cancer detection. By combining transfer learning and data 
augmentation, CNN with XGBoost as a classifier was used. After acquiring 
accuracy results through transfer learning, this reasearch connects the final 
layer to the XGBoost classifier. Furthermore, the interface design for the 
evaluation process was established using the Python programming language 
and the Django platform. The results: i) the stages of CNN-XGBoost 
integration on histopathology images for breast cancer detection were 
discovered. ii) achieved a higher level of accuracy as a result of the CNN- 
XGBoost integration for breast cancer detection. In conclusion, breast cancer 
detection was revealed through the integration of CNN-XGBoost through 
histopathological images. The combination of CNN and XGBoost can 
enhance the accuracy of breast cancer detection. 
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1. INTRODUCTION 

In almost every field, including healthcare, computer programming has now surpassed the human 
brain's ability to make decisions. This means that the advancement of programming in the field of computers 
will be used by humans to uphold human dignity itself. Now, the human brain's ability to be imitated by 
computers, particularly when it comes to solving problems that can be applied in the health care field. Through 
the latest technological advances, computer programming can be used to detect breast cancer. Data from the 
American Cancer Society (ACS) state that breast cancer is the main type of cancer in women. It was 
estimated that breast cancer was discovered in a total of 281,550 women and 43,600 died from breast cancer 
in 2021 [1]. The contribution of study is expected to open public awareness to be careful about the dangers of 
cancer. Cancer diagnosis is a hot topic in the healthcare field. The descriptive of cancer characteristic details 
and cancer study information can be obtained through the advancement of information technology, software, 
and hardware. Previously, cancer detection tended to use the digital image method, case-based reasoning, as 
well as the certainty factor method. Now, several scientists turned to data mining technology and prediction 
of breast cancer by machine learning models to deal with the significantly rising cancer feature data and 
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knowledge. Machine learning is a computer strategy for enhancing performance or generating accurate 
forecasts that is based on prior knowledge [2]. Using the feedback signal as instruction, machine learning 
seeks for an essential model of some raw data in a chosen probability space. It is feasible to automate a wide 
range of cognitive tasks using these concepts, from voice commands to operating an automobile [3]. 
Automation of the process of finding features or effective representations for machine learning on various 
tasks, including automatically transferring knowledge from one task to another simultaneously can be done 
using deep learning [4]. According to LeCun et al. [5], a system learns to classify immediately from pictures, 
texts, or voice. After a computer has been trained to recognize patterns in a great amount of labeled data, it 
converts an image's intensity values into an input space or feature vector, which classifiers can be used to 
find and categorize patterns in the input. 

Computer models can undertake classification algorithms from text, pictures, or audio using one of 
the machine learning methodologies called deep learning. The models are built using a convolutional neural 
network (CNN) architecture with numerous layers and a huge number of datasets. Deep learning is employed 
in medical imaging to automatically detect cancer cells [6]. CNN is a sort of deep learning that is often 
employed in picture data, according to [7], [8]. CNN could be used to identify and recognize objects in 
images. In general, CNN isn't too distinct from previous neural networks. According to [9] CNN is composed 
primarily of significant layers known as convolutional layers. These convolutional layers are constructed 
from a fundamental structure block known as a convolution. A convolution is used to test the advantages of 
the pixels surrounding a small area of an image and modify it into nothing more than a single pixel. 

A CNN can include dozens or even hundreds of layers, each learning to recognize distinct images. 
In image processing, every train picture is given a different resolution, and the output of each image is 
analysed and utilized as input to the next layer. Image processing can be begun with a step formula such as 
intensity and edges or complexity. According to the thickness of the layer, a function specifies an item 
uniquely [10], [11]. Because of the increased attention in deep learning, CNN has become the most 
frequently used approach for image analysis and classification. CNN has generated cutting-edge results in a 
variety of classification methods [12]. However, the deficiency of the dataset used may have an impact on the 
accuracy of CNN. The expected novelty in this issue can be resolve by adjusting the quantity of data used in 
the training process in order to enhance the level of accuracy. This is also in line with the opinion of [13]. To 
enhance the training samples [14], [15] used data augmentation. Data augmentation is a method of processing 
image data, augmentation is the process of altering or adjusting an image in such a way that the machine 
senses that the modified image is a different image, but people can still recognize that the modified image is 
the same image. This is a technique for increasing the number of images in a dataset by transforming the 
original images. Utilizing data augmentation improves the performance of the proposed models, particularly 
for classes that are still incorrectly classified [16]. 

Another option for overcoming the dataset's inadequacy, according to [17]-[19], is to use transfer 
learning. Transfer learning is a concept for utilizing previously learned characteristics of a big data set and 
then moving and implementing that learning to its dataset upon this [20] have succeeded in using the transfer 
learning and be able to increase accuracy by 18.3%. Aside from that, transfer learning can successfully 
resolve the issue of computing time and a small training dataset. Ren et al. [21] improved the performance of 
the modified national institute of standards and technology (MNIST) and Canadian institute for advanced 
research (CIFAR-10) datasets using CNN as the extraction of features and extreme gradient boosting 
(XGBoost) as a predictor, with MNIST's accuracy being 99.22% and CIFAR-10's being 80.77%. 

XGBoost is a robust and effective machine learning technique for tree boosting that has been widely 
used in numerous disciplines to obtain state-of-the-art outcomes on multiple data issues [22]. The gradient 
booster algorithm is a quick implementation of XGBoost, getting the advantages of good speed and high 
precision. Scalability in different scenarios is the most significant factor behind XGBoost's success. This 
scalability is due to the preceding algorithm being optimized. A novel tree learning algorithm for handling 
sparse data is included in the innovations provided. This success was proven when XGBoost became one of 
the techniques widely used in machine learning in various cases [23]. At the top level, this XGBoost classifier 
has been applied to CNN and will generate image classification results. Based on the above background, the 
formulation of the problem posed by the study in this article is being as. How is the application of transfer 
learning and data augmentation with XGBoost classifier in CNN in the process of early detection of breast 
cancer?. What is the level of accuracy of the results obtained for the process of early detection of breast cancer 
with CNN by utilizing transfer learning and XGBoost classifier?. In this article, the novelty introduced is the 
use of machine learning models using CNN and XGBoost as classifier, where the application was developed 
for testing by using Django framework, Python language, hypertext markup language (HTML), cascading 
style sheets (CSS) Bootstrap, library of Tensorflow 2.0, and Keras 2.3.1. The expected contribution of this 
strategy is that using the data augmentation, transfer learning, and the XGBoost classifier in CNN can improve 
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the accuracy of breast cancer detection in the residual networks (ResNet50) architectural model, which can be 
used as a reference by other researchers who conduct breast cancer detection studies. 


2. THEORETICAL BASIS 

According [24], the CNN-XGBoost learning algorithm is described as follows. Let 
X= (yp < j < M}, M denotes the training data set size, xj = [x1, X2,..., Xy] to be a collection of N 
feature vectors in R" or RYN * VN and vector x;'s label is yj. 
a. Begin by populating the training dataset, X = y| <j <M}. 
b. Increase the number of components in each training dataset x;, if required, as a result of which the new 
data item can be shaped into a square matrix of mentions, VN x VN . 
Convert x; tensor format, (VN, VN, 2), 
. Configure the convolutional parameters for learning features: the quantity of convolutional layers, L; 


a9 


convolutional layer yield depth, z; define the filter sizes for every layer., K® , and strides for filtering, S o 
e. Determine the convolutions for every layer /, in 1...L to produce the Y;(l) for layer, /: 


fd-1) 
l l l l-1 
yP = $ B » KỌ x y, f 
— 


J 


f. Reshape ye to a vector of lengths (VN, VN ,z®) — yy 
g. Create a fresh training data for the class prediction layer 


Xnew = (YY, y) 1 <j S M} 


h. Set the parameters for the prediction phase: the overall number of trees, Kg; parameters of normalization, 
as well as y and A; parameter of column subsampling; the maximum tree depth and; rate of learning. 
i. Evaluate the output class labels: 


Kx 
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Where F = f (YY;) = Wayy)(q: RY >T, w E€ R”). 
j. Determine the optimal leaf weight for the finest tree structure. 
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k. Using the scoring function, compute the tree structure's quality, q. 
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T denotes the amount of leaves on the tree. 
l. Determine the finest splitting points 
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m. Terminate. 
The algorithm of CNN-XGBoost has time complexity: OLd?mnpq + Or(Kt + logB) which reduces to 
OLd*mnpq, where L represents the number of layers, the number of input or output channels is denoted 
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by d, the size of the data matrix is m X n, the size of filter is p X q, r = ||x|| is the amount of entries that 
are not missing, the amount of trees is K, the tree depth is f, and the length of the block is B. 


3. METHODOLOGY 
The research methodology used as a reference in conducting research is as follows. 
a. Literature study 
At this stage, materials, information and theory are collected in books as well as references from 
articles, journals, and other scientific works related to the object of research and the algorithms used in research. 
b. Data collection 
Data collection was carried out to obtain the data used in the study. The dataset used in the study 
was histopathology breast cancer consisting of 162 images of breast cancer specimens (BCa) that were 
scanned 40 times. From the extraction results 277,524 patches measuring 50 x 50 (198,738 invasive ductal 
breast cancer (IDC) negative and 78,786 positive IDC). So that in this study used 2 classes, namely negative 
and positive. 
c. Research stages 
— Data augmentation stages 
Data augmentation is done in real-time during training on CNN. After data augmentation can 
provide many variations to CNN and increase the amount of relevant data [25]. This research uses 
ImageDataGenerator from Keras to augment data. ImageDataGenerator is performed during the training 
process. By using this type of data augmentation the researcher wants to ensure that the network, when 
trained, will get new variations in each epoch. The process of applying on-the-fly augmentation with the 
following flow: a) image data generator receives a batch (collection) of images to be used for training, 
b) image data generator takes that image set and then applies a series of random changes to each image in 
the batch (including random rotation, resizing, and shifting), c) replacing the original batch with a new 
batch that was changed randomly, d) train CNN on randomly changed batches (the original data itself is 
not used for training). In each epoch, ImageDataGenerator applies transformations to existing images and 
transforms them for training. The number of pictures in each epoch is the same as the number of original 
images. This research will use various kinds of transformations such as _ featurewise_center, 
rotation_range width_shift_range, height_shift_range, horizontal_flip. Transformations will be added as 
accuracy increases. 
— Transfer learning stages 
The usage of knowledge from heretofore trained models to accomplish new tasks is regarded to as 
transfer learning. As a feature extractor, transfer learning will be used in this research. The transfer 
learning process can reduce training time, computations, and the use of available hardware resources [20]. 
— XGBoost stages 
In this study, XGBoost was used for the classification stage. After the CNN training stage, XGBoost 
replaces the output layer, replaces CNN's softmax classifier, and uses CNN's trainable features for 
training. Lastly, by testing images, the CNN-XGBoost model obtains new classification findings [21]. 
— System planning 
In this study, the system design was carried out as a test tool for the application of the CNN 
technique using transfer learning and data augmentation with the XGBoost classifier in the early detection of 
breast cancer. 


4. RESULTS AND DISCUSSION 
4.1. Results obtained 

In this part, the researcher divided the stages, namely the model concept and system 
implementation. The model concept describes the stage of making the model get accurate results and finally 
saving the model that will be used in system implementation. The use of data augmentation and transfer 
learning in the CNN method to enhance the accurateness of breast cancer detection has several stages of 
research. At this point, the authors develop a flow concept that will be applied to the CNN method via a 
combination of data augmentation and transfer learning. In this research, the type of transfer learning applied 
to the CNN as an image feature extraction process in the image dataset is ResNet50. 

The process of extracting the breast cancer histopology feature is carried out using a Google 
colaboratory server with GPU runtime settings connected to Google Drive as storage media. The author 
performs the image feature extraction process by extracting all images in the image dataset into a fixed-size 
vector. The image dataset feature extraction process is carried out using the ImageNet architecture, namely 
ResNet50 to create the ResNet5O model. The model will also use ImageDataGenerator as a function for the 


Bulletin of Electr Eng & Inf, Vol. 11, No. 2, April 2022: 803-813 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 807 


data augmentation method during training. After gaining accuracy results through transfer learning, this 
research links the final layer to the XGBoost classifier. 

The first step, to run the modeling process on Google Colab is installing Kaggle on Google Colab, 
followed by uploading the Kaggle token API so that the dataset from Kaggle can be downloaded. The dataset 
is divided into 2 sets, namely training and testing with a ratio of 70:30. This data division uses the Sklearn 
library using the train test split function with the number of tests splits worth 0.3. The way the train test split 
works is to take test data on the entire dataset up to 30% and use the rest as training data. The results of data 
division are training data of 194,266 and testing data of 83,258 with details as shown in Table 1. 


Table 1. Details of test data and training data 
Positive Negative 
Training data 55.150 139.116 
Testing data 23.636 59.622 


Balancing the training dataset is carried out so that the model training process can learn optimally 
later. In this research, data augmentation was carried out on positive class training data from 55,150 to 
139,125 using ImageDataGenerator. Making train and test batches is used to facilitate the training stage. This 
study using the number of train batches of 500 for training and validation. The system executes data 
augmentation which in this research uses Image Data Generator owned by the Keras library. Three arguments 
are used to augment the data., horizontal flip (true), vertical flip (true), and fill mode (nearest). The number 
of images that will be generated by image data generator follows the number of epochs. Figure 1 shows 
sample images before and after the data augmentation process. 


Before augmentation After | After augmentation | 


Figure 1. Image results before and after augmentation 


The next step is transfer learning. This work employs the ResNet50 architectural model, which was 
trained on ImageNet data by constructing an architecture that is similar to ResNet50 but lacks the full linked 
layer and downloads weight. Keras provides several popular ImageNet models. ResNet50 weights can 
acknowledge colors and textures because they were trained on ImageNet datasets. As a result, the ResNet50 
weights are applied to all images in the dataset to extract features. The input model value was calculated from 
the feature extraction process's final vector, which is 2048. The next process is training and data validation. In 
the training process, the data used is 70% of the data taken randomly, the rest of the train test split with a test 
size of 0.3. At the training stage, Adam's optimizer was used to complete the task with 10 epoch training 
models, as seen in Figure 2. 

The model was trained using the ResNet50 architecture. The loss and accuracy values from the 
model were measured on the training and validation datasets by testing the 10 epoch model. The model 
trained by CNN is assessed using just a graph as shown in Figure 3 after it has been effectively trained and 
has received the outcomes of training and validation accuracy. 


Integration of convolutional neural network and extreme gradient boosting for breast ... (Endang Sugiharti) 


808 o ISSN: 2302-9285 


Epoch 1/10 


===] - ETA: @s - loss: 0.2832 - accuracy: 0.8813 
Epoch 00001: val_loss improved from inf to 0.32441, saving model to breast_histopathology transfer_best.hdf5 
389/389 [ ===] - 325s 837ms/step - loss: 0.2832 - accuracy: @.8813 - val_loss: 0.3244 - val_accuracy: @.8600 


Epoch 2/10 


==] - ETA: Øs - loss: 0.2400 - accuracy: 0.8979 
Epoch 00002: val_loss improved from @.32441 to @.31377, saving model to breast_histopathology transfer_best.hdf5 
===] - 245s 631ms/step - loss: 0.2408 - accuracy: @.8979 - val_loss: 9.3138 - val_accuracy: @.8659 


Epoch 3/10 

389/389 [====== ===] - ETA: @s - loss: @.2320 - accuracy: @.9015 

Epoch 00003: va @.31377 to @.30978, saving model to breast_histopathology transfer_best.hdf5 

389/389 [= ==] - 239s 615ms/step - loss: 0.2320 - accuracy: @.9015 - val_loss: 0.3098 - val_accuracy: @.8671 


Epoch 4/10 


======================] - ETA: @s - loss: @.2254 - accuracy: 0.9044 
Epoch 00004: val_loss improved from @.30978 to @.30424, saving model to breast_histopathology_ transfer_best.hdf5 


389/389 [ ===] - 239s 614ms/step - loss: 0.2254 - accuracy: @.9044 - val_loss: 9.3042 - val_accuracy: @.8702 
Epoch 5/10 
389/389 [= === ==] - ETA: @s - loss: @.2232 - accuracy: @.9@59 


Epoch 00005: val_loss improved from 0.30424 to @.30019, saving model to breast_histopathology_transfer_best.hdf5 


389/389 [==============================] - 239s 614ms/step - loss: 0.2232 - accuracy: @.9059 - val_loss: 0.3002 - val_accuracy: @.8727 
Epoch 6/10 
389/389 [==============================] - ETA: @s - loss: @.2185 - accuracy: 6.9072 


Epoch 00006: val_loss did not improve from @.30@19 

389/389 [= ] - 238s 613ms/step - loss: 0.2185 - accuracy: @.9072 - val_loss: 0.3003 - val_accuracy: @.8716 
Epoch 7/10 

389/389 [ ===] - ETA: @s - loss: @.2190 - accuracy: 0.9070 

Epoch 60007: val_loss improved from @.30019 to @.29782, saving model to breast_histopathology transfer_best.hdf5 


val_accuracy: @.8729 


Epoch 8/10 

389/389 [==============================] - ETA: @s - loss: @.2177 - accuracy: 0.9070 

Epoch 00008: val_loss did not improve from @.29782 

389/389 [ val_accuracy: @.8713 
Epoch 9/10 

389/389 [ ===] - ETA: @s - loss: @.2159 - accuracy: 0.9091 


Epoch 00009: val_loss improved from @.29782 to @.29432, saving model to breast_histopathology transfer_best.hdf5 

389/389 [==============================] - 239s 614ms/step - loss: 0.2159 - accuracy: @.9091 - val_loss: 6.2943 - val_accuracy: @.8753 
Epoch 10/10 

389/389 [==============================] - ETA: @s - loss: @.2135 - accuracy: 0.9100 


Epoch 00010: val_loss improved from @.29432 to @.29321, saving model to breast_histopathology transfer_best.hdf5 
389/389 [ ] - 239s 614ms/step - loss: 0.2135 - accuracy: @.910@ - val_loss: @.2932 - val_accuracy: @.8748 


Figure 2. Accuracy result of training and validation of ResNet50 model 


Training and Validation Accuracy 


Accuracy 
o o 
w O i 
boe 


0.87 . 
s.s.s..." 
0.86 PT ke 
0.85 
1 2 3 4 5 6 7 8 9 10 
Epoch 
= Training Accuracy = æ ææ aValidation Accuracy 
Training and Validation Loss 
0.34 
` 
0.32 LITTE 
03 bee ee ee ...stas 
> owun LLETTTTTI 
© 0.28 
5 
8 0.26 
< 
0.24 
0.22 
0.2 
1 2 3 4 a 7 6 8 9 10 
Epoch 


= Training Accuracy = = = = m Validation Accuracy 


Figure 3. Graph of training and validation accuracy and training and validation loss model of ResNet 50 


Figure 4 is the result of the classification accuracy generated by the training and validation of the 
ResNet50 model, with an accuracy of 88%. After the model goes through the training and validation process, 
the next part is to take the last part of the model layer to be integrated into the XGBoost classifier. After 
taking the last layer of the process model, the next step is to apply it to the XGBoost classifier using the 
library from XGBoost. With the XGBoost classifier, the model can produce a train and test score just after 
the fitting procedure. The results of accuracy using the XGBoost classifier are 92% for the training score and 
90% for the test score. Furthermore, the model is analyzed using a confusion matrix, as in Figure 5. 
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precision recall f1l-score support precision recall fl-score support 

negative 0.91 0.92 0.91 59622 Negative 0.91 0.93 0.92 4961 
positive 0.79 0.76 0.77 23636 Positive 0.93 0.90 0.92 5039 
accuracy 0.88 83258 accuracy 0.92 10000 
macro avg 0.85 0.84 0.84 83258 macro avg 0.92 0.92 0.92 10000 
weighted avg 0.87 0.88 0.87 83258 weighted avg 0.92 0.92 0.92 10000 


Figure 4. Classification accuracy for the ResNet 50 Figure 5. Training features report for ResNet 50 
model model with XGBoost classifier 


The conclusion of this study can be seen in the results table of loss and accuracy values as in 
Table 2. Table 2 shows that XGBoost affects increasing the accuracy of the ResNet50 model. The results of 
the accuracy of CNN validation with ResNet50 have increased after using the XGBoost classifier from 88% 
to 90%, while accuracy. The high value of this accuracy is influenced by the image quality which is very 
adequate and good. 


Table 2. Results of loss and accuracy values 
Training data accuracy _ Validation data accuracy 
CNN+ResNet50 0.91 0.88 
CNN+ResNet50+XGB 0.92 0.90 


4.2. System implementation stage 

At this point, the software is being prepared to test the approaches used in this research. The user 
interfaces design for this program was created using HTML, CSS, and Javascript programming languages. 
Meanwhile, the Django platform and the Python language are used for analysis. The results from the previous 
training model will be saved into a file with extension of .45 which will be used later in the application. This 
application has 3 menus, namely home, what is, and about us. On the home menu, there is a description of 
the research title and the method used. The display design of the home menu can be seen in Figure 6. 


Universitas Negeri Semarang 


Welcome To Breast Cancer Classification System 


Figure 6. Home application display 


After selecting "predict now" will move the screen to the image upload display. In the upload image 
view, select the image to be classified. The image upload display can be seen in Figure 7. After selecting the 
image you want to classify, then select “predict”, the prediction results will appear as in Figure 8. 


Universitas Negeri Semarang 


Upload Breast Cancer Image 


Figure 7. Display for image uploading 
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The Result of Prediction 
= 


Figure 8. Display of prediction results 


4.3. Discussion 

This study applies transfer learning and the XGBoost classifier to optimize the CNN to improve the 
accuracy of breast cancer detection using histopathology images. Based on the results of the implementation 
of a combination of transfer learning and the XGBoost classifier on the CNN that has been carried out, it can 
be known the accuracy improvement of CNN-XGBoost in diagnosing breast cancer. The dataset is obtained 
from Kaggle with a class distribution graph as shown in Figure 9. 
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Figure 9. Class distribution dataset 
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Figure 9 also shows that the dataset is very unbalanced in classes O and 1, where 1 for the class is 
positive and 0 for a class is negative. The vast difference in the amount of data can affect the training process, 
where the model will learn more about negative classes than positive classes. For this reason, this study 
conducted balancing training data. The data was added to the positive class. A total of 55,150 new data were 
generated using ImageDataGenerator, so this study used 2 data augmentation processes. The first is data 
augmentation for the balancing training data process and the second for the training process. The training 
data balancing stage is carried out after the distribution of training data and test data with details of 30% for 
testing data and 70% for training data. 

The next stage is to create train and test batches for the training process. After making the number of 
batches used, the next step is training and validation using the ResNet50 model, this model has been 
previously trained or commonly known as transfer learning. Transfer learning used in this study functions as 
feature extractors and the weights used is derived from ImageNet [24]. After building the transfer learning 
model, the model is compiled so that it can be used for the training process later. 

In this process, we arrange the model to be ready for the training process. Where the variables used 
are is being as; 1) loss, the method of measuring the loss value based on categorical values; 2) optimizer is an 
algorithm for updating weights and biases in the learning process of artificial neural networks, to minimize 
errors or the difference between the network output and target. The Adam optimizer is a combination of the 
RMSProp and momentum optimizer, which does have several advantages, including being computationally 
efficient, memory efficient, and suitable for various optimization problems in the field of machine learning; 
3) metrics, the measured value of the matrix in this study uses the accuracy value as the measurement value. 
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The next stage, data augmentation is used to provide a new picture of the dataset during the training 
process later. By using data augmentation, the dataset to be trained later will not be the same as the dataset at 
first, data augmentation will train the model with new data according to the functions used in each batch. 
This data augmentation is done using ImageDataGenerator from the Keras package [3] comparing and 
analyzing data augmentation methods in image classification, and the results can be concluded that 
traditional transformation has more impact than other augmentation methods. The data augmentation 
function used in this study is as shown in Table 3. 


Table 3. Arguments and data augmentation functions 
Argument Function Amount 
Horizontal Flip Flip the image horizontally True 
Vertical Flip Flip the image vertically True 
Fill mode Fill in the empty area Nearest 


Based on Table 3 in data augmentation, will go through 3 arguments for each image, each image 
will be slightly different from the others due to horizontal and vertical flip. This will help the built model to 
recognize a large number of images, thus making them more efficient. This data augmentation works during 
the training process, the image will be augmented in real-time during the training process. Data augmentation 
is not carried out in the validation process because the validation process is a process to validate a model that 
has been trained with the original image so it is not good for changing the image in the validation process. An 
example of an image result that has gone through data augmentation can be seen in Figure 10. 


» Ee 8 % 
BE: B 
x: - BB 


Figure 10. Image results after going through data augmentation 


The next stage is training and validation to get accurate results in the transfer learning model. Then 
the model that has been trained is taken to the last layer and then applied to the XGBoost Classifier. XGBoost 
classifier is tasked to classify models that have been previously trained by ResNet50 so that they get new 
training and validation accuracy results. 

The validation accuracy has increased by 2% when utilizing the XGBoost classifier, from 88% to 
90%, as demonstrated in the training and validation results. Furthermore, the model is stored for use in the 
test system. This is further confirmed by the findings of [7], which found that applying transfer learning 
using GoogleNet architectural can increase the model accuracy. The contribution of this approach is that it 
has been established that using data augmentation, transfer learning and the XGBoost classifier in the CNN 
method can improve the accurateness of breast cancer detection in the ResNet50 architectural model, and it 
can be used as a reference by other researchers conducting breast cancer detection studies. 


5. CONCLUSION 

The following conclusions are drawn from the obtained results and their explanation as given as; 
i) the following phases of CNN-XGBoost implementation were discovered via data augmentation and 
transfer learning. Starting with the collection of the dataset, followed by the preprocessing stage, dividing the 
data into training data and testing data with details of 70% training data and 30% for testing data then data 
augmentation is carried out using ImageDataGenerator from the Keras package, transfer learning used in this 
study functions as feature extractors and weights used are from ImageNet. The next stage is the training and 
validation stage to get the accuracy results for each model. Then the model that has been trained is taken to 
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the last layer and then applied to the XGBoost classifier. XGBoost classifier is tasked to classify models that 
have been previously trained by ResNet50 so that they get new training and validation accuracy results. Then 
the model is stored for use in the test system; ii) the accuracy of validation results of the integration of 
CNN-XGBoost on histopathological images for breast cancer detection has increased by 2% from 88% to 
90%. This research gives the directions for future research, which can be given are being as: i) the detection 
accuracy of this breast cancer will increase along with the image quality used so that it can be tried on other 
types of images and ii) to get better results, it can also be done by increasing the number of datasets, 
increasing data augmentation by adding any different parameter or trying other architectural models. 
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